\chapter{Call-site manager}
\label{chap-call-site-manager}

The contents of this chapter is a summary and an update of the paper
we published in ELS 2021 \cite{DBLP:conf/els/Strandh21}.

The default calling conventions are described in
\refChap{chap-calling-conventions}.  In summary:

\begin{itemize}
\item we do not use callee-saves registers,
\item the caller sets up the stack frame for the callee,
\item all arguments are passed on the stack on top of the callee stack
  frame, and
\item return values beyond the first one are returned on the stack.
\end{itemize}

Parsing the arguments is done by a part of the callee function
called the \emph{prelude}.  The result of this action is to initialize
lambda-list parameters stored either in registers or in the stack
frame.  The last action of the prelude is to remove the arguments from
the stack.

If the call site consists of the name of a function followed by forms
that compute the arguments, then the code for the default calling
conventions is not directly present in the caller.  Instead, the call
site consists of a single unconditional \texttt{jump} instruction.
The target of this instruction is called a \emph{trampoline snippet}
or just a \emph{snippet} for short.  The call-site manager allocates a
new snippet when required, and alters the target of the \texttt{jump}
instruction.  For a general call, the snippet contains the
instructions of the default calling conventions, and those are
generated by the call-site manager.  The last instruction of the
snippet is another unconditional \texttt{jump} instruction that jumps
back to the instruction following the first \texttt{jump} instruction.

The snippet is recomputed whenever the callee is updated, such as when
a call to \texttt{(setf fdefinition)} is made.  The snippet is thus
customized for the callee in the following ways:

\begin{enumerate}
\item The static environment is a constant and does not have to be
  loaded from the rack.  If the callee does not close over any
  variables, the static environment does not have to be accessed at
  all.
\item The frame size of the callee is a constant and does not
  have to be loaded from the rack
\item The entry point is a constant and does not have to be loaded
  from the rack.
\end{enumerate}

Already, these constants save many memory operations, thereby making
the call more efficient.  However, the prelude of a function is
followed by a \emph{general body} of the function, and that body
contains instructions for communicating with the debugger as described
in \refChap{chap-debugger}.  So the call-site manager uses the entry
point of the prelude only when a breakpoint has been set in the
callee.

The compiler generates an optimized function body to be used when
there are no breakpoints.  This optimized function body is not
preceded by a prelude for parsing arguments.  Instead, the call-site
manager uses information provided in the callee to access arguments in
the caller and puts them in the locations expected by the optimized
function body.  Thus, there is no argument parsing performed by the
callee, which is particularly advantageous when the callee has
optional and/or keyword parameters and the call-site provides those
arguments in the most common way.%
\footnote{So that keyword arguments are pairs, each one of which is a
  symbol in the \texttt{keyword} package followed by a value.}
In that case, the call-site manager generates code to access the
relevant argument directly and put it directly where the optimized
function body of the callee expects it.

When a function is redefined, first every trampoline snippet that calls
that function is updated so that it uses the default argument-parsing
entry point of the function.  This entry point is always valid in each
function, so redefining the function will cause no problem.  After the
function has been updated, each of these trampoline snippets is again
updated according to the optimized entry point of the new function.

Section 3.2.2.3 of the \commonlisp{} standard allows for an
implementation to consider (in some situations, and for reasons of
performance) a named function call with a globally defined name to
mean something more specific than, or even completely different from,
a call to the function return by a call to \texttt{fdefinition} with
that name.  We do not intend to take advantage of this possibility
given by the standard, simply because the call-site manager will make
the call very efficient whatever function it refers to.  Furthermore,
the call-site manager will be able to turn a tail-recursive call into
the equivalent of a loop.  And taking advantage of this freedom
allowed by the standard will make interactive development and
debugging harder.

Initial versions of \sysname{} will not include a call-site manager.
Each call site will contain the general call sequence described
above.

\section{Unfinished suggestion of further optimizations}

This section contains a fairly radical suggestion for further
optimizations that involve the call-site manager.

For the purpose of this section, let's call a register that needs to
be preserved around a function call a \emph{precious} register.

\begin{itemize}
\item Precious registers are not saved by the caller itself.  Instead,
  this becomes the responsibility of the trampoline snippet created by
  the call-site manager.  The call-site contains a list of pairs, each
  consisting of a register and a stack location, where the register is
  one that needs to be preserved and the stack location is where the
  register should be stored, should storing be required.  In the most
  general case, the call-site manager creates a trampoline snippet
  that stores each register in the list in its stack location before
  calling the callee, and restores the registers after the callee
  returns.
\item Each function contains a set (represented as a bitmap) of the
  registers it itself uses, and the registers used by all of its
  callees.  This information is basically the union of the similar
  information of all its callees, and the set of registers the
  function itself uses.
\item For a given pair of a caller and a callee, the trampoline
  snippet needs to save and restore a caller register that must be
  preserved across the call only if that register is used by the
  callee, directly or indirectly, as indicated by the bitmap of the
  callee.
\item When the compiler generates code for a function, the register
  allocator chooses registers randomly, rather than systematically in
  some (perhaps numerical) order.  This way, the likelihood that a
  register that is used by a caller is also used by its callee is
  lowered.
\end{itemize}

We believe that this technique will avoid many register spills across
calls, and therefore it will save many memory accesses.

The main difficulty of this suggestion is what happens when a function
is redefined, especially if one of its callers is already on the
invocation stack.  The new definition will use a different set of
registers than the old definition, and it becomes very possible that
one such register is also used by a caller, and that this register was
not saved by the trampoline snippet.

At the very least, the stack would have to be traversed, and registers
would have to be saved to stack locations, according to the set of
pairs, each consisting of a register and a stack location, of each
call site.  But the trampoline snippet does not contain code for
reloading such registers after the callee returns, so a new snippet
would also have to be created.  It would even be possible to save
every register in the list of pairs and create code in the snippet for
restoring every such register.  Doing so would not be optimal, but no
worse than the default.  So perhaps the snippet could contain two
addresses to return to, one that restores only registers saved, and
one that restores all registers that need to be preserved across a
call.  Then when a function is modified, the stack could be scanned
and the return addresses modified.  A better suggestion would be for
the snippet to contain enough space for instructions to restore all
registers that need to be preserved around a function call, but by
default only instructions to restore registers used by the callee are
present.  When a function is redefined that has callers on the stack,
all registers that need to be preserved around a function call are
saved, and then the snippet is altered to restore all those registers.

Another (minor) difficulty is the stack-scanning part of the garbage
collector.  It would have to scan the stack from bottom to top in
order to determine which registers are live (and contain \commonlisp{}
objects) to trace.

This idea might be too radical, though.  A typical function may have
many of callers, so a lot of work may have to be done when such a
function is redefined.  The following idea might avoid this problem,
though.

It is never an error to overestimate the set of registers used by a
function.  If a function that is likely to take considerable time to
execute (where the time includes the execution time of its callees)
can be marked as using all its registers.  The overhead of a caller to
save all its precious registers is then small.  So the compiler can
estimate the execution time of a function and if it is sufficiently
large, it can mark it as having all its registers used.  Then, only
simple functions will have a smaller set indicated as used.  This
suggestion will limit the propagation of register sets so that is
might be more reasonable.  The effect will then be the same as that of
inlining simple functions, but without the problems of inlining (i.e.,
that the function doing the inlining must be recompiled when the
callee is redefined).

% LocalWords:  callee callees allocator inlining
